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Abstract. A method for the construction of approximate analytical expressions for 
the stationary marginal densities of general stochastic search processes is proposed. By 
the marginal densities, regions of the search space that with high probability contain 
the global optima can be readily defined. The density estimation procedure involves 
a controlled number of linear operations, with a computational cost per iteration that 
grows linearly with problem size. 
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1. Diffusion and Global Optimization 

Stochastic strategies for optimization are essential to many of tlie lieuristic tectiniques 
used to deal witli complex, unstructured global optimization problems. Methods like 
simulated annealing [H El [3l H] and evolutionary algorithms [HI El [7] , have proven to be 
valuable tools, capable of give good quality solutions at a relatively small computational 
effort. In spite of their success, these approaches present a major drawback, namely the 
absence of valid bounds on the obtained solutions. A common feature of deterministic 
global optimization algorithms is the progressive reduction of the domain space until the 
global optimum has been found with arbitrary accuracy [HI |9] . An analogous property 
for stochastic algorithms has been largely lacking. In this contribution is introduced a 
method for the estimation of the asymptotic probability density of a general stochastic 
search process in global optimization problems. The convergence of the estimated 
density can be clearly assessed, and with the help of this density, reliable bounds for the 
location of the global optimum are derived. The procedure involves linear operations 
only, and a well defined number of evaluations of the given cost function. The presented 
results indicate that by the proposed approach, regions of the search space can be 
discarded on a probabilistic basis. This property may be implemented in a variety of 
ways in order to improve existing or develop new optimization algorithms, and open the 
door for the construction of probabilistic optimality certificates in large scale nonlinear 
optimization problems. 

The roots of stochastic search methods can be traced back to the Metropolis 
algorithm [10], introduced in the early days of scientific computing to simulate the 
evolution of a physical system to thermal equilibrium. This process is the base of 
the simulated annealing technique [I], which makes use of the convergence to a global 
minimum in configurational energy observed in physical systems at thermal equilibrium 
as the temperature goes to zero. The method presented in this contribution is rooted in 
similar physical principles as those on which simulated annealing and related algorithms 
[H [HI El E] are based. However, in contrast with other approaches, the proposed 
method considers a density of points instead of Markov transitions of individual points. 
Moreover, the main goal of the proposed approach is not the convergence to global 
minima as a randomness parameter is reduced, but the approximation of the probability 
density after an infinitely long exploration time of the search space, keeping a fixed 
randomness. 

Consider the minimization of a cost function of the form ^(xi, X2, Xat) 
with a search space defined over Li^n < a^n < -Z^2,n- A stochastic search process for this 
problem is modeled by 



*r^ = -^ + ^W, (1) 

where e{t) is an additive noise with zero mean. Equation ([T]), known as Langevin 
equation in the Statistical Physics literature [121 [13], captures the basic properties 
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of a general stochastic search strategy. Under an uncorrelated Gaussian noise with 
constant strength, Eq. ([1]) represents a search by diffusion, while a noise strength that 
is slowly varying in time gives a simulated annealing. Notice that when choosing an 
external noise of infinite amplitude, the dynamical influence of the cost function over 
the exploration process is lost, leading to a blind search. The model given by Eq.([l]) 
can be also interpreted as an overdamped nonlinear dynamical system composed by N 
interacting particles. The temporal evolution of the probability density of such a system 
in the presence of an additive Gaussian white noise, is described by a linear differential 
equation, the Fokker - Planck equation [121 [13] . 
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where D is a constant, called diffusion constant, that is proportional to the noise 
strength. The direct use of Eq. ([2]) for optimization or deviate generation purposes 
would imply the calculation of high dimensional integrals. It results numerically much 
less demanding to perform the following one dimensional projection of Eq. ([2]). Under 
very general conditions (e. g., the absence of infinite cost values), the equation ([2]) 
has a stationary solution over a search space with reflecting boundaries [T^ [T3]. The 
stationary conditional probability density satisfy the one dimensional Fokker - Planck 
equation 



dp{Xn\{Xj^n = X*}) dV 
D ^ h p{Xn\{Xj^n = Xj})-— = 0. (3) 

An important consequence of Eq. Q is that the marginal can be sampled by 

drawing points from the conditional p{xn\{xj-Ln = x*}) via a Gibbs sampling [15j. It 
is now shown how, due to the linearity of the Fokker - Planck equation, a particular 
form of Gibbs sampling can be constructed, such that its not only possible to sample 
the marginal density, but to give an approximate analytical expression for it. From Eq. 
([2]) follows a linear second order differential equation for the cumulative distribution 

y{Xn\{Xj^n = X*}) = J'iloPi^'nliXj^n = X*})dx'^, 



(Py I dV dy ^ 

— - H — = 0, (4) 

= 0, y{L2,n) = 1- 

Random deviates can be drawn from the density p(a:„|{xjy„ = x*}) by the fact that 
y is an uniformly distributed random variable in the interval y G [0,1]. Viewed as a 
function of the random variable Xn, y{xn\{xj-tn}) can be approximated through a linear 
combination of functions from a complete set that satisfy the boundary conditions in 
the interval of interest. 
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L 

y{Xn\{Xj^n}) =^anpi{Xn). (5) 
1=1 

Choosing for instance, a basis in which = 0, the L coefficients are uniquely defined 
by the evaluation of Eq. (jl]) in L — 1 interior points. In this way, the approximation of y 
is performed by solving a set of L linear algebraic equations, involving L — 1 evaluations 
of the derivative of V. 

The proposed procedure is based on the iteration of the following steps: 

1) Fix the variables Xj^n = x* and approximate y{xn\{xj^n}) by the use of formulas 
dl and dS]). 

2) By the use of y{xn\{xj^n}) construct a lookup table in order to generate a deviate 
drawn from the stationary distribution p{xn\{xjjLn = ^j})- 

3) Update * and repeat the procedure for a new variable Xj^n- 

The iteration of the three steps above give an algorithm for the estimation of the 
equilibrium distribution of the stochastic search process described by Eq. ([1]). A 
convergent representation for is obtained after taking the average of the coefficients 
a's in the expansion ([5]) over the iterations. In order to see this, consider the expressions 
for the marginal density and the conditional distribution, 

p{Xn) = J piXn\{Xj^n})pi{Xjjtn})d{Xj^n}, (6) 



pixj{xj^^r.})dx^. (7) 

From the last two equations follow that the marginal y{xn) is given by the expected 
value of the conditional y{xn\{xjjLn}) over the set {xjy„}. 



y{Xn) = E{^^^^y[y{Xn\{Xj^n})]- (8) 

All the information on the set {xj^^n} is stored in the coefficients of the expansion ([5]). 
Therefore 

L 

{y) = J2i^i)'fiM ^y{xn), (9) 
1=1 

where the brackets represent the average over the iterations of the density estimation 
procedure. 

Previous preliminary applications of the density estimation method on the 
generation of suitable populations of initial points for optimization algorithms can 
be found in [16]. In the next section the capabilities of the proposed algorithm 
for the construction of reliable probabilistic bounds is tested on several benchmark 
unconstrained examples and in a family of well known constrained NP-hard problems. 
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2. Examples 

The fundamental parameters of the density estimation procedure, L and D, have a clear 
meaning, which is very helpful for their selection. The diffusion constant "smooth" the 
density. This is evident by taking the limit D ^ oo in Eq. (jlj), which imply an uniform 
density in the domain. The number of base functions L, on the other hand, defines the 
algorithms capability to "learn" more or less complicated density structures. Therefore, 
for a given D, the number L should be at least large enough to assure that the estimation 
algorithm will generate valid distributions y{xn\{xj-^n})- A valid distribution should be 
a monotone increasing continuos function that satisfy the boundary conditions. The 
parameter L ultimately determines the computational cost of the procedure, because at 
each iteration a system of size oc L of linear algebraic equations must be solved times. 
Therefore, the user is able to control the computational cost through the interplay of 
the two basic parameters: for a larger D a smoother density should be estimated, so a 
lesser L can be used. 

The density estimation algorithm is tested on the following benchmark 
unconstrained problems: 

Schwefel: 

^ I 

N = 6, f = 418.9829N -Y.^nSin{^\xn\), 

n=l 

-500 <Xn< 500, solution : x* = (420.9687, ...,420.9687), f{x*) = 0. 
Levy No. 5: 

5 5 

N = 2 f = Y^zcosiii - + z) ^ jcos(0- + 1)X2 + j) 

i=l j=l 
+ (xi + 1.42513)^ + (X2 + 0.80032)^ 

-10 <Xn< 10, solution : x* = (-1.3068, -1.4248), /(a;*) = -176.1375. 

Booth: 

N = 2, / = (si + 2x2 - 7f + (2x1 + X2- 5)2, 
-10 <Xn< 10, solution : x* = (1, 3), /(x*) = 0. 

Colville: 

AT = 4, / = 100(x2 - xi)' + (1 - xi)' + 90(x4 - X3)' + (1 - X3)' 

+ 10.1((X2 - 1)' + (X4 - 1)') + 19.8(X2 - 1)(X4 - 1), 

— 10 < XnlO, solution: x* = (1, 1), /(x*) = 0. 
Rosenbrock: 

N 

N = 20, / = ^100(x„+i-x2)2 + (x„-l)2, 

n=l 

-10<x„<10, solution: x* = (1, 1), /(x*) = 0. 
For the experiments, the following specific form of the expansion has been used. 
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y = ^aisin {21 - V 




(10) 



so the size of the algebraic system of equations is L— 1. The hnear system has been solved 
by a LU decomposition routine [T7]. The gradients have been calculated numerically, 
with two cost function evaluations per derivative. In this way, the total number of cost 
function evaluations per iteration goes like 2{L — 1)N. 

In Fig. [T]two different pairs L, D have been considered in the study of the Schwefel 
problem. This problem has a second best minimum at a relatively large distance of the 
global optimum. This is reflected on the estimated densities, but at small D a clear 
distiction between the two regions is made. The Schwefel problem is an example of a 
separable function, that is, a function given by a linear combination of terms, where each 
term involves a single variable. Separable problems generate an uncoupled dynamics 
of the stochastic search described by Eq. ([1]). Because of this fact, the estimation 
algorithm converges in only one iteration for separable problems. The Schwefel example 
also illustrates that the density estimation algorithm works well on functions that are 
not derivable in some points. This is a consequence of the finite number of gradient 
evaluations required by the procedure. 



■ ■ ■ D = 3U0 
— D = 50 



Figure 1. Density estimation for the variable xi of the Schwefel function. Two 
different diffusion constants have been considered, taking L = 100 in both cases. In 
the two cases the density clearly represents the structure of the cost function. The 
density is sharply peaked around the optimal value for the lesser D. 



In contrast to the Schwefel function, the stochastic search process associated to the 
Levy No. 5 problem represents a coupled nonlinear dynamics. Despite that this problem 
has about 760 local minima [18], the estimation algorithm shows good convergence 
properties, as is illustrated in Fig. O 

The two previous examples show how, once that the estimation algorithm as 
attained convergence, is possible to define a region of the search space in which with high 
probability the global optimum is located. This concept is more sistematically studied 



7 




Figure 2. Probability densities associated to the Levy No. 5 problem, using the 
parameter values L — 200, D ~ 70 and M — 300. The densities maxima are at 
coordinates (—1.3,-1.42). 



through the introduction of normahzed distances. The distance normahzed with respect 
to the search space Li^„ < a^n < -^2,n between two points x and x* is defined by 



distance 



(xi - xiy + ... + (xn - x*j^y 

\ (Li,i-L2,i)2 + ... + (Li,^-L 



2,N 



Two measures written in terms of normahzed distances are presented in the examples 
of Figures [3], H] and O i) The distance between the global optimum and the point in 
which the density is maximum, ii) The length of the 95% probability interval around 
the point of maximum probability. 



n ' I ' I ' 



0.001 



■ ■ ■ 95 per cent interval 

— Distance to optitnuni. Booth problem 




D= 10. L= 100 







■ 95 per cent interval 

■ Distance to optimum, Booth problem 



D= 1,L = 200 




500 1000 



1500 
M 



2000 2500 3000 



Figure 3. Density estimation for the Booth problem. Semi ~ log scale has been used. 



Under general conditions a Gibbs sampling displays geometric convergence |19j . 
In the presented experiments the running time has been chosed such that the 95% 
probability interval differ in less than 0.01 between succesive iterations. Additionally, 
several control runs from different and independent starting conditions have been 
performed, indicating convergence to the same corresponding region within the 
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D = 300, L = 300 



■ 95 per cent interviil 

■ Distance to (jptimum, Ct}lville problem 



500 
M 



Figure 4. Density estimation for the Colvillc problem. 



D= 1000, L= 100 



■ 95 per cent interval 

■ Distance to optimum, Rosenbrock problem 



400 600 
M 



Figure 5. Density estimation for the Rosenbrock problem. 



predefined accuracy. As expected for a Gibbs sampling, tlie density appears to contract 
to a region of space that is independent of the staring point [20]. The numerical 
realizations suggest that convergence is attained at a few hundreds of iterations for 
all of the examples, even for the 20 - dimensional Rosenbrock problem, which as been 
reported to be difficult to solve by stochastic heuristics like genetic algorithms [2T] . 

The numerical experiments show that the global optimum is contained in a region 
close to the point of maximum probability, and that this region gets more sharp as 
D decreases. A straighforward application of this behavior would be, for instance, on 
simulated annealing - type algorithms. Starting with a large diffusion constant, search 
regions can be iteratively discarded. By the use of the density estimation method, a 
probability measure is associated with each region. In this way the user can define 
a certain level of precision in the search. Several statistical quantities can be readily 
calculated like measures of confidence, for instance probability intervals or characteristic 
fiuctuation sizes. 

Because the estimation algorithm depends on linear operations only, additional 
nonlinearities in the cost function can be treated with essentialy the same efficiency, 
giving more freedom and fiexibility in modeling. For instance, the application of the 
density estimation algorithm to constrained problems can be done in a very direct 
manner through the addition of suitable "energy barriers" (or more precisely, "force 
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barriers"). These barriers don't need to be very large. Their main purpose is not to 
define prohibited regions, but only regions with low probability. The original constrained 
problem is transformed to an unconstrained cost function with additional nonlinearities. 
Of course, the design of adequate barriers may be a difficult problem - dependent task. 
However, at least for some problems the approach seems to be straightforward. This is 
illustrated on the classical NP-hard knapsack problem [22]. It is well known that many 
standard instances of the knapsack model can be efficiently solved by exact methods 
[22], which makes it an ideal example for experimentation with the density estimation 
algorithm. The knapsack problem is formulated as 



N 

min - ^ qnXn (12) 

n=l 

N 

S.t. WnXn < C, 

n=l 

-xl + Xn<0, 0<Xn<l, 

where Wn and c are positive numbers. The quadratic constraint is equivalent to the 
usual restriction to binary variables. The following transformation is proposed, 

N 

min - QnXn (13) 
n=l 

N 1 



n=l ^ 



+ exp{-bo[-xl + Xn]) 

exp (bl[En=l '^nXn " c]) - 1 



exp (-62[E^=l '^nXn " c] 

S.t. < x„ < 1, 

For illustrative purposes, consider the instance q = (2,3,5), w = (3,5,7), c = 10 of 
the knapsack problem. By inspection, the solution is given by a; = (1,0,1). In Fig. 
[6] typical densities produced by the estimation algorithm for this instance are shown. 
The selection of the parameters has been done after the performance of short runs, 
measuring the effects of each of the nonlinear terms. The parameters have been tuned 
such that: a) The first nonlinear term alone produce symmetric densities peaked in the 
neighborhood of {0, 1}. b) The addition of the second nonlinear term and the original 
cost function generate densities in which the configurations with maximum probabilty 
satisfy the Yln=i ^nXn < c constraint. Notice that the configuration that corresponds 
to the global optimum has the maximum probability. 

A larger example is presented in Fig. [71 The exact solution has been calculated 
with the branch and bound algorithm supplied in the GNU Linear Programming Kit 
[23] . The instance has been generated by taking Wn uniformily distributed in an interval 
[1, i?], and 



in 




gn = W^n + (£«-!)— + —, (14) 

En uniform deviate in [—1,1], 

which imply strong hnear correlations between Qn and Wn- Instances of this kind 
are relevant to real management problems in which the return of an investment is 
proportional to the sum invested within small variations [22]. It has been argued that 
these type of instances fall in a category which is close to the "worst case" scenario for 
exact algorithms [22]. Despite of that, the estimation method converge to densities in 
which the optimum is contained in a region with high probability. Moreover, from the 
definition of the normalized distance, follows that the closest integers to the elements of 
the vector that represent the point of maximum probability differ in ~ 1 positions from 
the exact solution. 



■ ■ ■ 95 per cent interval 
Distance to optimum 




50 100 150 200 250 300 

M 



Figure 7. The density estimation of an instance of a knapsack problem with 30 
variables. 



Three independent realizations of the numerical experiment over instances 
generated by Eq. ( |T4l) have been performed, varying the orders of magnitude of R 
and c. The results are summarized in Table 1, indicating the number of ^ 1 flips that 
the normalized distances imply. 
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Table 1. Density estimation on instances of the knapsack problem with 30 variables. 
The final two columns indicate the resulting 95% intervals and gaps to optimum like 
normalized distances/flips. 



R 


c 


ko 


h 


bo 


bi 


&2 


M 


L 


D 


Interval 


Gap 


10 


100 


10 


7.1 


10 


0.01 


2bi 


300 


100 


1 


0.425/5 


0.185/1 


100 


500 


100 


37.0 


10 


0.001 


3bi 


300 


100 


15 


0.432/6 


0.363/4 


1000 


3000 


1000 


315.0 


10 


0.0001 


3bi 


300 


100 


100 


0.433/6 


0.257/2 



It should be remarked that, although the instances of the knapsack model discussed 
in this section are quickly solvable by exact algorithms, no reference to the particular 
structure of the original problem has been used for the density estimation. In fact, the 
problem has been treated like a highly nonlinear cost function of 30 variables. 

3. Heuristics Based on Stationary Density Estimation 

The potential benefits of the stationary density estimation algorithm as a tool for the 
construction of new heuristics for high dimensional global optimization problems is 
illustrated through the following greedy random search procedure: 

1) Run an iteration of the stationary density estimation algorithm. 

2) An initial best point is given by the global maximum of the estimated density 

3) Define a population of + 1 points. One point is the current best solution and other 
is the current point with maximum probability density. The remaining — 1 points 
are randomly drawn from an uniform distribution centered around the best point. For 
each dimension, the corresponding uniform distribution has a length equal to the typical 
fluctuation size given by the estimated density. 

4) Run a downhill simplex routine. The starting conditions are given by the simplex 
defined by the points generated at step 3 as vertices. 

5) If from step 4 results a point which improves the best known objective value, then 
update the best point. 

6) Run an iteration of the density estimation algorithm. 

7) Go to step 3. 

From an evolutionary perspective, the above procedure acts at two different levels. 
At a short time scale finite populations evolve locally in a very greedy fashion. On 
the other hand, large changes on the population composition are dictated by a long 
time scale dynamics, which is consistent with the learned information about the global 
cost landscape. This information is gained through the approximation of the long term 
statistical density of a diffusive search process. 

The short time scale exploration of the solution space is dominated by the downhill 
simplex method, which is a deterministic search based on function evaluations of 
a simplex vertices [24j. In a A^ dimensional search space, a population of A + 1 
points defined by the corresponding A^ + 1 vertices evolve under simple geometric 
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transformations, namely reflection, expansion and contraction. At each iteration, a 
new trial point is generated by the image of the worst point in the simplex (reflection). 
If the new point is better than all other vertices, the simplex expands in its direction. 
If the trial point improves the worst point, a reflection from the new worst point is 
performed. A contraction step is made when the worst point is at least as good as the 
reected point. In this way the simplex eventually surrounds a local minimum. In the 
experiments presented here, the implementation of the downhill simplex given by [17] 
has been used, with a fractional decrease of cost value of at least lOe — 4 as termination 
criteria. A maximum of 50000 function evaluations in the downhill simplex routine is 
allowed. 

From a given reference point x^'"^^^\ an initial simplex for each call to the 
downhill simplex routine is defined through the characteristic length scales A„, as 
rf,ii) — ^{best) _|_ where the are unit vectors. The estimated long term density 
provides a vertex (the point with maximum likelihhood at the current stage) and most 
importantly, typical fluctuation sizes, denoted by an- These are given by the first two 
moments of the estimated density. 



Due to the simple form of the expansion of the estimated density, all the integrals 
over the variables domain that are needed for moment calculation are performed 
analitically. The resulting expressions are finite sums with L terms. 
The typical fluctuation sizes (ITS!) provide a natural definition for the characteristic length 
scales A„, in the sense expressed in step (3) of the greedy diffusive search described above. 

For illustration purposes, consider the Rosenbrock test function. In FigjH] are 
presented some plots of the beahavior of our greedy stochastic search for the Rosenbrock 
problem of A = 20 variables. The graphs represent the cost function values over 
successive iterations. Four samples from a total of 100 runs are plotted. The result 
of a version of the algorithm in which an uniform density over the search space is used 
instead of the estimated long term density is also plotted. The success in finding the 
optimum is defined by a gap size with the known global optimum lesser than 0.001. 
Each run consist of 100 iterations. Over the total number of runs of the greedy 
stochastic search, 90% have been successful, and the 100% of the runs outperform 
the search based on uniform distributions. An 80% of the runs have been successful 
in less than 48 iterations and 30% in less than 13 iterations. For the 13 iterations 
cases, an average number of 28080 cost function evaluations was needed. It should be 
remarked that the 20-dimensional Rosenbrock test function has been reported to be 
extremely difficult to solve by randomized optimization algorithms. For instance, in an 
experiment similar to the one presented here, it has been reported in [25] a success rate 
of zero after 400000 function evaluations for Simulated Annealing, Cross-Entropy and 
Model Reference Adaptive Search. On the other hand, in [21] is reported that Genetic 
Algorithms and Scatter Search methods are unable to succeed in the 20-dimensional 




(15) 
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Rosenbrock problem after 50000 cost function evaluations. 
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Figure 8. Greedy stochastic search for the Rosenbrock problem. The parameter 
values of the density estimation are L = 30 and D = 10000. 

A more exhaustive experimentation with possible heuristics based on the stationary 
density estimation algorithm is in progress. 

4. Conclusion 

The presented results strongly suggest that the proposed density estimation algorithm 
can be used to construct probabilistic bounds on the location of global optima for 
large classes of problems. The density estimation is performed in a well defined 
number of elementary operations. The developed theory and the numerical experiments 
indicate that any given desired precision on the bounds can be attained with some 
finite values of the basic parameters, performing a finite number of iterations. The 
total computational cost per iteration grows linearly with problem size. The algorithm 
estimates the marginal density of each separate variable, which makes it suitable for 
parallel implementation. These features make the proposed method a promising tool, 
opening the possibility of constructing probabilistic optimal certificates for large scale 
unstructured problems. Experimentation in this direction is in progress. On the other 
hand, the density estimation algorithm may be used to develop new heuristics or improve 
existing stochastic or deterministic algorithms. 
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